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Abstract 

We study onUne learning when individual instances are corrupted by adversarially chosen 
random noise. We assume the noise distribution is unknown, and may change over time 
with no restriction other than having zero mean and bounded variance. Our technique relies 
on a family of unbiased estimators for non-linear functions, which may be of independent 
interest. We show that a variant of online gradient descent can learn functions in any dot- 
product (e.g., polynomial) or Gaussian kernel space with any analytic convex loss function. 
Our variant uses randomized estimates that need to query a random number of noisy copies 
of each instance, where with high probability this number is upper bounded by a constant. 
Allowing such multiple queries cannot be avoided: Indeed, we show that online learning is 
in general impossible when only one noisy copy of each instance can be accessed. 



1 Introduction 

In many machine learning applications training data are typically collected by measuring certain 
physical quantities. Examples include bioinformatics, medical tests, robotics, and remote sensing. 
These measurements have errors that may be due to several reasons: sensor costs, communication 
constraints, or intrinsic physical limitations. In all such cases, the learner trains on a distorted 
version of the actual "target" data, which is where the learner's predictive ability is eventually 
evaluated. In this work we investigate the extent to which a learning algorithm can achieve a good 
predictive performance when training data are corrupted by noise with unknown distribution. 

We prove upper and lower bounds on the learner's cumulative loss in the framework of online 
learning, where examples are generated by an arbitrary and possibily adversarial source. We model 
the measurement error via a random perturbation which affects each instance observed by the learner. 
We do not assume any specific property of the noise distribution other than zero-mean and bounded 
variance. Moreover, we allow the noise distribution to change at every step in an adversarial way 
and fully hidden from the learner. Our positive results are quite general: by using a randomized 
unbiased estimate for the loss gradient and a randomized feature mapping to estimate kernel values, 
we show that a variant of online gradient descent can learn functions in any dot-product (e.g., 
polynomial) or Gaussian RKHS under any given analytic convex loss function. Our techniques are 
readily extendable to other kernel types as well. 

In order to obtain unbiased estimates of loss gradients and kernel values, we allow the learner 
to query a random number of independently perturbed copies of the current unseen instance. We 
show how low- variance estimates can be computed using a number of queries that is constant with 
high probability. This is in sharp contrast with standard averaging techniques which attempts to 
directly estimate the noisy instance, as these require a sample whose size depends on the scale of 
the problem. Finally, we formally show that learning is impossible, even without kernels, when only 
one perturbed copy of each instance can be accessed. This is true for essentially any reasonable loss 
function. 

Our paper is organized as follows. In the next subsection we discuss related work. In Sec.|2]we 
introduce our setting and justify some of our choices. In Sec. |4] we present our main results but 
before that, in Sec. |3j we discuss the techniques used to obtain them. In the same section, we also 
explain why existing techniques are insufficient to deal with our problem. The detailed proofs and 
subroutine implementations appear in Sec. [5] with some of the more technical lemmas and proofs 



relegated to the appendix. We wrap up with a discussion on possible avenues for future work in 
Seel 

1.1 Related Work 

In the machine learning literature, the problem of learning from noisy examples, and, in particular, 
from noisy training instances, has traditionally received a lot of attention — see, for example, the 
recent survey [TT]. On the other hand, there are comparably few theoretically-principled studies on 
this topic. Two of them focus on models quite different from the one studied here: random attribute 
noise in PAC boolean learning [3l[8], and malicious noise [9l|5]. In the first case, learning is restricted 
to classes of boolean functions and the noise must be independent across each boolean coordinate. 
In the second case, an adversary is allowed to perturb a small fraction of the training examples in 
an arbitrary way, making learning impossible in a strong informational sense unless this perturbed 
fraction is very small (of the order of the desired accuracy for the predictor) . 

The previous work perhaps closest to the one presented here is [lOj . where binary classification 
mistake bounds are proven for the online Winnow algorithm in the presence of attribute errors. 
Similarly to our setting, the sequence of instances observed by the learner is chosen by an adversary. 
However, in TD] the noise is generated by an adversary, who may change the value of each attribute 
in an arbitrary way. The final mistake bound, which only applies when the noiseless data sequence 
is linearly separable without kernels, depends on the sum of all adversarial perturbations. 

2 Setting 

We consider a setting where the goal is to predict values y E M. based on instances x e M'*. In 
this paper we focus on kernel-based linear predictors of the form x i— >• (w, ^(x)), where VP is a 
feature mapping into some reproducing kernel Hilbert space (RKHS). We assume there exists a 
kernel function that efficiently implements dot products in that space, i.e., fc(x, x') = ($(x), ^(x')). 
Note that a special case of this setting is linear kernels, where ^(•) is the identity mapping and 
fc(x,x') = (x,x'). 

The standard online learning protocol for linear prediction with kernels is defined as follows: at 
each round t, the learner picks a linear hypothesis Wf from the RKHS. The adversary then picks an 
example {xt,yt) and reveals it to the learner. The loss suffered by the learner is £((wt, ^(xj)), j/t), 
where £ is a known and fixed loss function. The goal of the learner is to minimize regret with respect 
to a fixed convex set of hypotheses W, namely 

T T 

^£((w„*(xO>,y,)- mjn^^((w,vl/(x0),2/t). 

Typically, we wish to find a strategy for the learner, such that no matter what is the adversary's 
strategy of choosing the sequence of examples, the expression above is sub-linear in T. 

We now make the following twist, which limits the information available to the learner: instead 
of receiving (xj,yf), the learner observes yt and is given access to an oracle Af. On each call, At 
returns an independent copy of xt -I- Zt , where Zt is a zero-mean random vector with some known 
finite bound on its variance (in the sense that E[||Zf|p] < a for some uniform constant a). In 
general, the distribution of Zt is unknown to the learner. It might be chosen by the adversary, and 
change from round to round or even between consecutive calls to At . Note that here we assume that 
yt remains unperturbed, but we emphasize that this is just for simplicity - our techniques can be 
readily extended to deal with noisy values as well. 

The learner may call At more than once. In fact, as we discuss later on, being able to call At 
more than once is necessary for the learner to have any hope to succeed. On the other hand, if the 
learner calls At an unlimited number of times, it can reconstruct Xt arbitrarily well by averaging, 
and we are back to the standard learning setting. In this paper we focus on learning algorithms 
that call At only a small, essentially constant number of times, which depends only on our choice 
of loss function and kernel (rather than T, the norm of Xt , or the variance of Zt , which will happen 
with naive averaging techniques). Moreover, since the number of queries is bounded with very high 
probability, one can even produce an algorithm with an absolute bound on the number of queries, 
which will fail or introduce some bias with an arbitrarily small probability. For simplicity, we ignore 
these issues in this paper. 

In this setting, we wish to minimize the regret in hindsight with respect to the unperturbed data 
and averaged over the noise introduced by the oracle, namely 
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^^((w,,*(xO),yt)- nrin V^((w,M'(x,)),2/0 

,t^l t=l 



(1) 



where the random quantities are the predictors Wi, W2, . . . generated by the learner, which depend 
on the observed noisy instances (in the appendix, we briefly discuss alternative regret measures, 
and why they are unsatisfactory). This kind of regret is relevant where we actually wish to learn 
from data, without the noise causing a hindrance. In particular, consider the batch setting, where 
the examples {{xt,yt)}]~i are actually sampled i.i.d. from some unknown distribution, and we wish 
to find a predictor which minimizes the expected loss E[^((w,x), y)] with respect to new examples 
(x,j/). Using standard online-to-batch conversion techniques, if we can find an online algorithm 
with a sublinear bound on Eq. M, then it is possible to construct learning algorithms for the batch 
setting which are robust to noise. That is, algorithms generating a predictor w with close to minimal 
expected loss E[£((w,x),?/)] among all w G W. 

While our techniques are quite general, the exact algorithmic and theoretical results depend a 
lot on which loss function and kernel is used. Discussing the loss function first, we will assume that 
£((w, ^(x)), y) is a convex function of w for each example (x,y). Somewhat abusing notation, we 
assume the loss can be written either as £((w, ^(x)), y) = /(?/(w, 'I'(x))) or as £{{w,'^{x)),y) — 
/((w, \l/(x)) — y) for some function /. We refer to the first type as classification losses, as it 
encompasses most reasonable losses for classification, where y G { — 1,+1} and the goal is to predict 
the label. We refer to the second type as regression losses, as it encompasses most reasonable 
regression losses, where y takes arbitrary real values. For simplicity, we present some of our results 
in terms of classification losses, but they all hold for regression losses as well with slight modifications. 

We present our results under the assumption that the loss function is "smooth", in the sense 
that i'{a) can be written as X]n=o 7"^"' ^^r any a in its domain. This assumption holds for instance 
for the squared loss £{a) = o^, the exponential loss £{a) = exp(a), and smoothed versions of loss 
functio ns s uch as the hinge loss and the absolute loss (we discuss examples in more details in Sub- 
section 4.2). This assumption can be relaxed under certain conditions, and this is further discussed 
in Subsection 13.21 

Turning to the issue of kernels, we note that the general presentation of our approach is somewhat 
hampered by the fact that it needs to be tailored to the kernel we use. In this paper, we focus on 
two families of kernels: 

Dot Product Kernels: the kernel fc(x, x') can be written as a function of (x,x'). Examples of such 
kernels /c(x,x') are linear kernels (x,x'); homogeneous polynomial kernels ((x, x'))", inhomogeneous 
polynomial kernels (1 + (x, x'))"; exponential kernels e^'''^ ^; binomial kernels (1 + (x, x'))^", and 
more (see for instance [T^ITC] ). 

Gaussian Kernels: fc(x, x') = e^"''^'' " '"^ for some a^ > 0. 

Again, we emphasize that our techniques are extendable to other kernel types as well. 

3 Techniques 

Our results are based on two key ideas: the use of online gradient descent algorithms, and construc- 
tion of unbiased gradient estimators in the kernel setting. The latter is based on a general method 
to build unbiased estimators for non-linear functions, which may be of independent interest. 

3.1 Online Gradient Descent 

There exist well developed theory and algorithms for dealing with the standard online learning 
setting, where the example (xt, yt) is revealed after each round, and for general convex loss functions. 
One of the simplest and most well known ones is the online gradient descent algorithm due to 
Zinkevich j.lTj . Since this algorithm forms a basis for our algorithm in the new setting, we briefly 
review it below (as adapted to our setting). 

The algorithm initializes the classifier Wi = 0. At round t, the algorithm predicts according to 
wt, and updates the learning rule according to Wf+i = P(wf — ?7tVt), where rjt is a suitably chosen 
constant which might depend on t; V* = ^'(j/t(wt, ^(xt)))yf^(xt) is the gradient of i[yt{w, ^(xt))) 
with respect to Wt; and P is a projection operator on the convex set W, on whose elements we wish 
to achieve low regret. In particular, if we wish to compete with hypotheses of bounded squared 
norm B^ , P simply involves rescaling the norm of the predictor so as to have squared norm at most 
B^. With this algorithm, one can prove regret bounds with respect to any w e W. 

A "folklore" result about this algorithm is that in fact, we do not need to update the predictor 
by the gradient at each step. Instead, it is enough to update by some random vector of bounded 
variance, which merely equals the gradient in expectation. This is a useful property in settings 
where (xt, yt) is not revealed to the learner, and has been used before, such as in the online bandit 
setting (see for instance [B1I71II])- Here, we will use this property in a new way, in order to devise 



algorithms which are robust to noise. When the kernel and loss function are linear (e.g., ^(x) = x 
and £{a) = ca + b for some constants b, c) , this property already ensures that the algorithm is robust 
to noise without any further changes. This is because the noise injected to each Xj merely causes the 
exact gradient estimate to change to a random vector which is correct in expectation: If we assume 
£ is a classification loss, then 

On the other hand, when we use nonlinear kernels and nonlinear loss functions, using standard 
online gradient descent leads to systematic and unknown biases (since the noise distribution is 
unknown), which prevents the method from working properly. To deal with this problem, we now 
turn to describe a technique for estimating expressions such as £'(yt(wf, \I'(xf))) in an unbiased 



manner. In Subsection 3.3 we discuss how ^(xt) can be estimated in an unbiased manner. 



3.2 Unbiased Estimators for Non-Linear Functions 

Suppose that we are given access to independent copies of a real random variable X, with expectation 
E[X], and some real function /, and we wish to construct an unbiased estimate of /(]E[X]). If 
/ is a linear function, then this is easy: just sample x from X, and return f{x). By linearity, 
E[/(X)] = /(E[X]) and we are done. The problem becomes less trivial when / is a general, non- 
linear function, since usually E[/(X)] ^ f{E[X]). In fact, when X takes finitely many values and / is 
not a polynomial function, one can prove that no unbiased estimator can exist (see jl3] , Proposition 8 
and its proof). Nevertheless, we show how in many cases one can construct an unbiased estimator of 
/(E[X]), including cases covered by the impossibility result. There is no contradiction, because we 
do not construct a "standard" estimator. Usually, an estimator is a function from a given sample to 
the range of the parameter we wish to estimate. An implicit assumption is that the size of the sample 
given to it is fixed, and this is also a crucial ingredient in the impossibility result. We circumvent 
this by constructing an estimator based on a random number of samples. 

Here is the key idea: suppose / : M — > M is any function continuous on a bounded interval. 
It is well known that one can construct a sequence of polynomials {Qn{'))^=i, where Qn{') is a 
polynomial of degree n, which converges uniformly to / on the interval. If Qn{x) = X]i=o lri,iX^ , let 
Q'j^{xi, . . . , Xn) = X]"=o ^"■,i Ylj=i Xj- Now, consider the estimator which draws a positive integer N 
according to some distribution P(A^ = n) — p„, samples X for N times to get a;i,a;2, . . . ^Xn, and 
returns -^ ((5^(xi, . . . , xn) — Q'j\[-i{xi, ■ ■ ■ , xn-i)) , where we assume Q'q = 0. The expected value 
of this estimator is equal to: 
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{Qwixi, ■ ■ ■ ,xn) - Q'n~i{xi, ■ ■ ■ ,xn~i)) 



'En,xi,...,x 

C30 

= / , '^xi,...,Xn \Qn\Xl^ ■ ■ ■ 1 Xn) ~ Qn-l\Xli • ■ ■ ; 2:„_l)J 

OO 

= ^(Q„(E[X]) - g„_i(E[X])) = f{E[X]). 

n=l 

Thus, we have an unbiased estimator of f{¥\X]). 

This technique appeared in a rather obscure early 1960's paper [15' from sequential estimation 
theory, and appears to be little known, particularly outside the sequential estimation community. 
However, we believe this technique is interesting, and expect it to have useful applications for other 
problems as well. 

While this may seem at first like a very general result, the variance of this estimator must be 
bounded for it to be useful. Unfortunately, this is not true for general continuous functions. More 
precisely, let N be distributed according to p„, and let 9 be the value returned by the estimator. In 
[2], it is shown that if X is a Bernoulli random variable, and if WyON^] < oo for some integer A; > 1, 
then / must be k times continuously differentiable. Since E[9N^] < (E[6'^] + E[iV^''])/2, this means 
that functions / which yield an estimator with finite variance, while using a number of queries with 
bounded variance, must be continuously differentiable. Moreover, in case we desire the number of 
queries to be essentially constant (i.e. choose a distribution for N with exponentially decaying tails), 
we must have E[A^'^] < oo for all k, which means that / should be infinitely differentiable (in fact, 
in [2] it is conjectured that / must be analytic in such cases). 

Thus, we focus in this paper on functions / which are analytic, i.e., they can be written as 
fix) = X^i^o T*-^* f'-'^ appropriate constants 70, 71, . . .. In that case, Qn can simply be the truncated 



Taylor expansion of / to order n, i.e., Q„ = J27=o^i-''^- Moreover, we can pick p„ ex 1/p" for 
any p > I. So the estimator becomes the following: we sample a nonnegative integer N according 
to P(A^ = n) = (p — l)/p"+^, sample X independently N times to get Xi, X2, . . . , x^, and return 






^^——^xiX2 ■ ■ ■ xn where we set 9 = ^^70 if iV = Op We have the following: 



Lemma 1. For the above estimator, it holds that E[0] ~ f{E\X\). The expected number of samples 
used by the estim,ator is l/{p — 1), and the probability of it being at least z is p~^. Moreover, if we 
assume that f+{x) = X]^o 17" 1^" exists for any x in the domain of interest, then 



no^] < -^fi (Vmx' 



Proof. The fact that ¥\6] — f{¥\X]) follows from the discussion above. The results about the 
number of samples follow directly from properties of the geometric distribution. As for the second 
moment, ¥\9'^] equals 
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n=0 



J^{yjj„\(,/pEm) ] = -IL_fl^^pE[x^_ 



n 



The parameter p provides a tradeoff between the variance of the estimator and the number of 
samples needed: the larger is p, the less samples do we need, but the estimator has more variance. 
In any case, the sample size distribution decays exponentially fast, so the sample size is essentially 
bounded. 

It should be emphasized that the estimator associated with LcmmafTlis tailored for generality, and 
is suboptimal in some cases. For example, if / is a polynomial function, then 7„ = for sufficiently 
large n, and there is no reason to sample N from a distribution supported on all nonnegative integers 
- it just increases the variance. Nevertheless, in order to keep the presentation unified and general, 
we will always use this type of estimator. If needed, the estimator can always be optimized for 
specific cases. 

We also note that this technique can be improved in various directions, if more is known about 
the distribution of X. For instance, if we have some estimate of the expectation and variance of X, 
then we can perform a Taylor expansion around the estimated E,[X] rather than 0, and tune the 
probability distribution of N to be different than the one we used above. These modifications can 
allow us to make the variance of the estimator arbitrarily small, if the variance of X is small enough. 
Moreover, one can take polynomial approximations to / which are perhaps better than truncated 
Taylor expansions. In this paper, for simplicity, we will ignore these potential improvements. 

Finally, we note that a related result in [5] implies that it is impossible to estimate /(E[X]) in an 
unbiased manner when / is discontinuous, even if we allow a number of queries and estimator values 
which are infinite in expectation. Therefore, since the derivative of the hinge loss is not continuous, 
estimating in an unbiased manner the gradient of the hinge loss with arbitrary noise appears to be 
impossible. Thus, if online learning with noise and hinge loss is at all feasible, a rather different 
approach than ours will need to be taken. 

3.3 Unbiasing Noise in the RKHS 

The third component of our approach involves the unbiased estimation of ^(xt), when we only 
have unbiased noisy copies of x^. Here again, we have a non-trivial problem, because the feature 
mapping VP is usually highly non-linear, so E[^(xt)] 7^ 'i'(E[x4]) in general. Moreover, ^ is not a 
scalar function, so the technique of Subsection |3 . 2| will not work as-is. 



^ Admittedly, the event TV = should receive zero probability, as it amounts to "skipping" the sampling 
altogether. However, setting P(A'' = 0) = appears to improve the bound in this paper only in the smaller 
order terms, while making the analysis in the paper more complicated. 



To tackle this problem, we construct an explicit feature mapping, which needs to be tailored to 
the kernel we want to use. To give a very simple example, suppose we use the homogeneous 2nd- 
degree polynomial kernel, fc(r, s) = ((r, s))^. It is not hard to verify that the function vp : K'' i— >• K'' , 
defined via ^(x) = {x\X\^x\Xi^ . . . , XdXd)^ is an explicit feature mapping for this kernel. Now, if we 
query two independent noisy copies x, x' of x, we have that the expectation of the random vector 
{x\x\^x\x'2^ . . . ^Xdx'^ is nothing more than ^(x). Thus, we can construct unbiased estimates of 
^(x) in the RKHS. Of course, this example pertains to a very simple RKHS with a finite dimensional 
representation. By a randomization trick somewhat similar to the one in Subsection |3.2[ we can 
adapt this approach to infinite dimensional RKHS as well. In a nutshell, we represent \l/(x) as an 
infinite-dimensional vector, and its noisy unbiased estimate is a vector which is non-zero on only 
finitely many entries, using finitely many noisy queries. Moreover, inner products between these 
estimates can be done efficiently, allowing us to implement the learning algorithms, and use the 
resulting predictor on test instances. 

4 Main Results 

4.1 Algorithm 

We present our algorithmic approach in a modular form. We start by introducing the main algorithm, 
which contains several subroutines. Then we prove our two main results, which bound the regret of 
the algorithm, the number of queries to the oracle, and the running time for two types of kernels: 
dot product and Gaussian (our results can be extended to other kernel types as well). In itself, the 
algorithm is nothing more than a standard online gradient descent algorithm with a standard 0{-s/T) 
regret bound. Thus, most of the proofs are devoted to a detailed discussion of how the subroutines 
are implemented (including explicit pseudo-code). In this section, we just describe one subroutine, 
based on the techniques discussed in Sec. |3] The other subroutines require a more detailed and 
technical discussion, and thus their implementation is described as part of the proofs in Sec. [5] In 
any case, the intuition behind the implementations and the techniques used are described in Sec. l3] 

For simplicity, we will focus on a finite-horizon setting, where the number of online rounds T 
is fixed and known to the learner. The algorithm can easily be modified to deal with the infinite 
horizon setting, where the learner needs to achieve sub-linear regret for all T simultaneously. Also, 
for the remainder of this subsection, we assume for simplicity that £ is a classification loss, namely 
can be written as a function of £(y(w, ^^(x))). It is not hard to adapt the results below to the case 
where £ is a regression loss (where £ is a function of (w, ^(x)) — y). 

We note that at each round, the algorithm below constructs an object which we denote as ^(xt). 
This object has two interpretations here: formally, it is an element of a reproducing kernel Hilbert 
space (RKHS) corresponding to the kernel we use, and is equal in expectation to ^(xt). However, 
in terms of implementation, it is simply a data structure consisting of a finite set of vectors from 
W^. Thus, it can be efficiently stored in memory and handled even for infinite-dimensional RKHS. 

Algorithm 1 Kernel Learning Algorithm with Noisy Input 

Parameters : Learning rate jy > 0, number of rounds T, sample parameter p > 1. 
Initialize: 

a, = Ofor alH = l,...,r. 

*(xi) for alH = 1,...,T 

// ^(x,;) is a data structure which can store a variable number of vectors in W^ 
For t = l...T 

Define wt = Y.%\ ^^^(xi) 

Receive At , yt // The oracle At provides noisy estimates of xj 

Let ^(x() :— Map_Estimate(At,p) // Get unbiased estimate of ^(xt) in the RKHS 

Let gt := Grad_Length_Estimate(At, yf,p) // Get unbiased estimate of £'(yt(wt, ^(x^))) 

Let at := —gt'qiVT // Perform gradient step 

Let nt := ^*^i E*^i at,,at,jProd(^(x,), ^(x,)) 

// Compute squared norm, where Prod(4'(xi), 4'(xj)) returns (^(x^), ^(xj)) 
If fit > Bjjj 11 If norm squared is larger than B^,, then project 

Let a,; := a^ - ™ for all i = 1, . . . , i 



Like ^(xt), wt_(.i has also two interpretations: formally, it is an element in the RKHS, as defined 



in the pseudocode. In terms of implementation, it is defined via the data structures ^(xi), . . . , ^(xj) 
and the values oi ai, . . . ,at at round t. To apply this hypothesis on a given instance x, we compute 
^j^-^ at_iProd($(xi),x'), where Prod(^(xi),x') is a subroutine which returns (^^(xi), ^(x')) (a 
pseudocode is provided as part of the proofs later on). 

We now turn to the main results pertaining to the algorithm. The first result shows what regret 
bound is achievable by the algorithm for any dot-product kernel, as well as characterize the number 
of oracle queries per instance, and the overall running time of the algorithm. 

Theorem 1. Assume that the loss function i has an analytic derivative i'{a) = X]n=o'^"^" f^^ '^^^ 
a in its domain, and let i'^{a) = X]ri=o 17" I'*" (assuming it exists). Assume also that the kernel 
A:(x,x') can be written as Q{{x,x')) for all x,x' G M'*. Finally, assume that E[||xt|p] < i?x for any 
Xt returned by the oracle at round t, for all t = 1, . . . ,T. Then, for all B^ > and p > \, it is 
possible to implement the subroutines of AlgorithmY^such that: 

• The expected number of queries to each oracle At is , f^-,2 ■ 

• The expected running time of the algorithm is O [T^ il + , Jls2 ) ) • 

• // we run Algorithm\l\ with rj = _Bw/\A*^+(v(P^^O^) j where u = B-^ I -^ ) Q{pBx), then 
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J2Kyt{^t,^{^t)))- min ^^(2;,(w,vl.(x,))) 

t=l " " - "' t=l 



< 



i'4^{p-l)u)V^ . 



The expectations are with respect to the randomness of the oracles and the algorithm throughout its 
run. 

We note that the distribution of the number of oracle queries can be specified explicitly, and 
it decays very rapidly - see the proof for details. Also, for simplicity, we only bound the expected 
regret in the theorem above. If the noise is bounded almost surely or with sub-Gaussian tails (rather 
than just bounded variance), then it is possible to obtain similar guarantees with high probability, 
by relying on Azuma's inequality or variants thereof (see for example [1]). 

We now turn to the case of Gaussian kernels. 

Theorem 2. Assume that the loss function £ has an analytic derivative (-'{a) = X]n=o'7"^" /''^ '^^^ 
a in its domain, and let ^^(a) = X]n=o iTnl^-" (assuming it exists). Assume that the kernel fc(x, x') 
is defined as exp(— ||x — x|p/cr^). Finally, assume that E[||x(||^] < Bj^ for any Xj returned by the 
oracle at round t, for all t = 1, . . . ,T. Then for all B^ > and p > \ it is possible to implement 
the subroutines of Algorithm^ such that 

• The expected number of queries to each oracle At is , _^>2 • 

1 J dV_ 



The expected running time of the algorithm is O [T^ [ 

If we run Algorithm\l\with rj = B^ I ^/ul'AyJp^^Vya) , where 



then 

E 







u = 


B^ 




T 










t=l 


^(yt(wt 


*(xO))- 


w : 


min 

w||2<B, 



^Bi + 2p^/B^ 
exp [ ^ 



T 

' t=i 



< £'+{^{p-l)u)V^ 



The expectations are with respect to the randomness of the oracles and the algorithm throughout its 
run. 

As in Thm.IT] note that the number of oracle queries has a fast decaying distribution. Also, note 
that with Gaussian kernels, cr^ is usually chosen to be on the order of the example's squared norms. 
Thus, if the noise added to the examples is proportional to their original norm, we can assume that 
B^ja^ — 0(1), and thus u which appears in the bound is also bounded by a constant. 

As previously mentioned, most of the subroutines are described in the proofs section, as part 
of the proof of Thm. [ij Here, we only show how to implement Grad_Length_Estimate subroutine, 



which retu rns the gradient length estimate gt- The idea is based on the technique described in 
Subsection 3.2 We prove that gt is an unbiased estimate of €'(yt(wt, ^(x())), and bound IE[g^]. As 



discussed earlier, we assume that £'{■) is analytic and can be written as i'{a) = X]n=o T"'* 

Subroutine 1 Grad_Length_Estimate(At, yf,p) 

Sample nonnegative integer n according to P(n) = (p — 1)/^""'"^ 
For j ^l,...,n 

Let ^(xf)j :— Map_Estimate(ylt) // Get unbiased estimate of ^(xt) in the RKHS 

Return gt := 2/t7n^ IljLi (E'^i at-i,<Prod(*(x,), ^{xt)i. 



Lemma 2. Assume t/iai E[*(xt)] = *(xt), and i/iai Prod(*(x), *(x')) returns (*(x),*(x')) for 
all X, x'. Then for any given Wt = at_i,i^(xi) + • • • + at_i^t_i^(x(_i) it holds that 

P 



Et[gt] = y,/(y,(w,,^(x,))) and Et[gt] < ^f+ [^JpB^B^i^)^ 

where the expectation is with respect to the randomness of Subroutineul and i'^{a) = E^o 17™!'^"- 
Proof. The result follows from Lemma [T] where gt corresponds to the estimator 9, the function / 
corresponds to £', and the random variable X corresponds to (wt,V['(xf)) (where ^{xt) is random 
and Wj is held fixed). The term E[X^] in Lemma ^ can be upper bounded as 



Et 



((wt,§(xO)) <\\y^tfEt ||*(xt 



< SwS^(x) 



a 



4.2 Loss Function Examples 

Theorems [l] and |2] both deal with generic loss functions £ whose derivative can be written as 
^^Q7„a", and the regret bounds involve the functions i'^{a) = E^o l7nk"- Below, we present a 
few examples of loss functions and their corresponding £'^. As mentioned earlier, while the theorems 
in the previous subsection arc in terms of classification losses (i.e., ^ is a function of y(w, ^(x))), 
virtually identical results can be proven for regression losses (i.e., f is a function of (w, ^(x)) — y), 
so we will give examples from both families. Working out the first two examples is straightforward. 
The proofs of the other two appear in Sec. |5] The loss functions are illustrated graphically in Fig. [l] 

Example 1. For the squared loss function, £{{w,x.),y) = ((w,x) — y)"^ , we have ('_f_[y^{p — l)u)) = 
2^{p~l)u. 

Example 2. For the exponential loss function, £((w,x),j/) = e^^^'^' , we have £'_^(^{p — l)u] = 
g\/(p-i)"^ 

Example 3. Consider a "smoothed" absolute loss function ^^(('w, ^(x)) — y), defined as an an- 
tiderivative of Erf (sa) for some s > (see proof for exact analytic form). Then we have that 

Example 4. Consider a "smoothed" hinge loss i?(y(w, $(x))), defined as an antiderivative of 
(Erf (s(a — 1)) — l)/2 for some s > (see proof for exact analytic form). Then we have that 

.V(/(,3T)^)<^^,i^(e^-^(-^)-). 

For any s, the loss function in the last two examples are convex, and respectively approximate 
the absolute loss! (w, 'I'(x)) — y\ and the hinge loss maxjO, 1 — 2/(w, ^(x)) } arbitrarily well for large 
enough s. Fig. [l] shows these loss functions graphically for s = 1. Note that s need not be large 
in order to get a good approximation. Also, we note that both the loss itself and its gradient are 
computationally easy to evaluate. 

Finally, we remind the reader that as discussed in Subsection |3.2[ performing an unbiased estimate 
of the gradient for non-differentiable losses directly (such as the hinge loss or absolute loss) appears 
to be impossible in general. On the flip side, if one is willing to use a random number of queries 
with polynomial rather than exponential tails, then one can achieve much better sample complexity 
results, by focusing on loss functions (or approximations thereof) which are only differentiable to a 
bounded order, rather than fully analytic. This again demonstrates the tradeoff between the sample 
size and the amount of information that needs to be gathered on each training example. 
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Figure 1: Absolute loss, hinge loss, and smooth approximations 



4.3 One Noisy Copy is Not Enough 

The previous results might lead one to wonder whether it is really necessary to query the same 
instance more than once. In some applications this is inconvenient, and one would prefer a method 
which works when just a single noisy copy of each instance is made available. In this subsection 
we show that, unfortunately, such a method cannot be found. Specifically, we prove that under 
very mild assumptions, no method can achieve sub-linear regret when it has access to just a single 
noisy copy of each instance. On the other hand, for the case of squared loss and linear kernels, 
our techniques can be adapted to work with exactly two noisy copies of each instance]^ so without 
further assumptions, the lower bound that we prove here is indeed tight. For simplicity, we prove 
the result for linear kernels (i.e., where fc(x,x') = (x,x')). It is an interesting open problem to show 
improved lower bounds when nonlinear kernels are used. We also note that the result crucially relies 
on the learner not knowing the noise distribution, and we leave to future work the investigation of 
what happens when this assumption is relaxed. 

Theorem 3. Let W be a compact convex subset ofM.'^, and let i{-,l) : R >—>■ M. satisfies the following: 
(1) it is bounded from below; (2) it is differentiable at with £'{0, 1) < 0. For any learning algorithm 
which selects hypotheses from W and is allowed access to a single noisy copy of the instance at each 
round t, there exists a strategy for the adversary such that the sequence wi,"W2,... of predictors 
output by the algorithm satisfies 

1 ^ 

limsupmax - V(£((wt,Xf),?;t) -f((w,xt),j/f)) > 



with probability 1 with respect to the randomness of the oracles. 

Note that condition (1) is satisfied by virtually any loss function other than the linear loss, 
while condition (2) is satisfied by most regression losses, and by all classification calibrated losses, 
which include all reasonable losses for classification (see [12] )• The most obvious example where the 
conditions are not satisfied is when £(■, 1) is a linear function. This is not surprising, because when 
£{■, 1) is linear, the learner is always robust to noise (see the discussion at Sec. [s]). 

The intuition of the proof is very simple: the adversary chooses beforehand whether the examples 
are drawn i.i.d. from a distribution I?, and then perturbed by noise, or drawn i.i.d. from some 
other distribution V without adding noise. The distributions V, V and the noise are designed so 
that the examples observed by the learner are distributed in the same way irrespective to which 
of the two sampling strategies the adversary chooses. Therefore, it is impossible for the learner 
accessing a single copy of each instance to be statistically consistent with respect to both distributions 
simultaneously. As a result, the adversary can always choose a distribution on which the algorithm 
will be inconsistent, leading to constant regret. The full proof is presented in Section 5.3 



^ In a nutshell, for squared loss and linear kernels, we just need to estimate 2((wt,Xt) — j/t)xt in an 
unbiased manner at each round t. This can be done by computing 2({wt,Xf) — yt)'x.'t, where Xt,X( are two 
noisy copies of xt . 



5 Proofs 

Due to the lack of space, some of the proofs are given in the the appendix. 

5.1 Preliminary Result 

To prove Thm. [l] and Thm. [2] we need a theorem which basically states that if all subroutines in 
algorithm 111 behave as they should, then one can achieve an 0{\/T) regret bound. This is provided 
in the following theorem, which is an adaptation of a standard result of online convex optimization 
(see, e.g., [13 )• The proof is given in Appendix [PJ 

Theorem 4. Assume the following conditions hold with respect to AlgorithmY^ 

1. For all t, ^(xj) and gt are independent of each other (as random variables induced by the 
randomness of Algorithmllj) as well as independent of any ^(x^) and gi for i < t. 

2. For all t, E[^(xi)] = ^(x^), and there exists a constant B^ > such that E[||^(x()|p] < B^. 

3. For all t, £[54] = yt£'{yt{wt, '^{^t))), and there exists a constant Bg > such that E[g^] < Bg. 
4- For any pair of instances x, x', Prod(^(x), ^(x')) = (^(x), ^(x')) . 

Then if Algorithm\l\ is run with 77 = \/ jfrgr , the following inequality holds 



E 



■ T T ■ 

^£(j/i(w,,*(xO))- min ^^(^.(w, *(x,))) 

.t=l " " - " t=l 



< y/B^BgB^T 



where the expectation is with respect to the randomness of the oracles and the algorithm throughout 
its run. 

5.2 Proof of Thm. [l] 

In this subsection, we present the proof of Thm. [l] We first show how to implement the subroutines 
of Algorithm [1] and prove the relevant results on their behavior. Then, we prove the theorem itself. 
It is known that for fc(-,-) = Q((x,x')) to be a valid kernel, it is necessary that Q((x, x')) can 
be written as a Taylor expansion X^J^o /^«((^'^'))"' "where /3„ > (see theorem 4.19 in [14 ). This 
makes these types of kernels amenable to our techniques. 

We start by constructing an explicit feature mapping ^(•) corresponding to the RKHS induced 
by our kernel. For any x, x', we have that 

00 
fc(x,x') = ^/3„((x,x')r = ^ 

n=0 

00 d 

= 2_^ /^" Z_^ ' ' ' Z_^ XkiXk2 ■ ■ ■ Xk„Xi._^x^.,^ 

n=0 fci = l fe„ = l 




OD d d 

EE--E 

n=Oki = l fc„ = l 



{y PnXkiXk2 ■ ■ ■ Xk^ 1 [yPnXk-^^X^^ ■ ■ ■ Xk^ 1 



This suggests the following feature representation: for any x, ^(x) returns an infinite-dimensional 
vector, indexed by n and fci, . . . , /;;„ € {1, . . . , d}, with the entry corresponding to n, /ci, . . . , fc„ being 
y/PnXki ■■■Xk„- The dot product between ^(x) and ^(x') is similar to a standard dot product 
between two vectors, and by the derivation above equals fc(x, x') as required. 

We now use a slightly more elaborate variant of our unbiased estimate technique, to derive an 
unbiased estimate of ^(x). First, we sample N according to P(A^ = n) = {p — l)/p"+^. Then, we 
query the oracle for x for N times to get x'^^^ . . . , x'^^^ and formally define \P(x) as 

§(x) = ^P E •■■ E 4?---4"Je„,,,,...,,„ (2) 

^ kl=l fc„=l 
where en,ki,...,kn represents the unit vector in the direction indexed by n, fci,...,fc„ as explained 
above. Since the oracle queries are i.i.d., the expectation of this expression is 

00-. f^^i d d 00 d d 

E^^?3IE■■•E^[4^■■4:^]e«,.......=EE•••E^4^••4:w,....,.. 

n=0 ^ ^ fci=l fc„ = l n=Ofei = l fe„ = l 

which is exactly ^(x). We formalize the needed properties of ^(x) in the following lemma. 



Lemma 3. Assuming ^(x) is constructed as in the discussion above, it holds that E[\E'(x)] — ^(x) 
for any x. Moreover, if the noisy samples Xf returned by the oracle At satisfy E[||x4|p] < _Bi, then 



E 



I*(x0 



< 



p-1 



QipBs.) 



where we recall that Q defines the kernel by fc(x, x') = (5((x, x')). 

Proof. The first part of tlie lemma follows from the discussion above. As to the second part, note 
that by ^, 



E 



l*(xt) 



= E 



fin 



P 



2n+2 



(p-l)2 



Z^ [^tM 



t,k„ 



,k„=l 



= E 



Pn 



p 



ip- 



,2n+2 " 



-A3) I 



J = l 



n=0 ^ Vi^ '' -f^ „=o ^ n=0 ^ 



where the second-to-last step used the fact that /3„ > for all n. 



n 



Of course, explicitly storing ^(x) as defined above is infeasible, since the number of entries is 



huge. Fortunately, this is not needed: we just need to store x! 



^(1) 



^(N) 



The representation above 



is used implicitly when we calculate dot products between ^(x) and other elements in the RKHS, 
via the subroutine Prod. We note that while iV is a random quantity which might be unbounded, 
its distribution decays exponentially fast, so the number of vectors to store is essentially bounded. 
After the discussion above, the pseudocode for Map_Estimate below should be self-explanatory. 



Subroutine 2 Map_Estimate(At,p) 



Sample nonnegative integer N according to V(N = n) = {p — l)/p' 
Query At for N times to get x! 



n+l 



^W 



^(N) 



Return x 



^(1) 



i^^) as *(xt). 



t T ■ ■ l-^t 



We now turn to the subroutine Prod, which given two elements in the RKHS, returns their dot 
product. This subroutine comes in two flavors: either as a procedure defined over ^(x),^(x') and 
returning (^(x), ^(x')) (Subroutine^; or as a procedure defined over ^(x), x' (Subroutinekj where 
the second element is an explicitely given vector) and returning (^(x), 'I'(x')). This second variant 
of Prod is needed when we wish to apply the learned predictor on a new given instance x'. 



Subroutine 3 Prod(^(x), *(x')) 

Let x^^\ . . . jX^") be the index and vectors comprising \l/(x) 
Let x'(^\ . . . , x'(" ' be the index and vectors comprising ^(x') 
li n ^ n' return 0, else return Pn F _i)2 11^=1 (x^-''', x'^^^) 



Lemma 4. Prod(*(x), §(x')) returns (*(x)^(x')). 

Proof. Using the formal representation of ^(x), vlf(x') in ([2|, we have that (^(x), ^(x')) is when- 
ever n y^ n' (because then these two elements are composed of different unit vectors with respect to 
an orthogonal basis). Otherwise, we have that 



(^(X)§(x')) = I3n 



Pn 



„2n+2 



b-1) 
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Mr, = l 
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Pn 
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vfci— 1 / Vfe^v— 1 / 

which is exactly what the algorithm returns, hence the lemma follows. 



2n+2 ^ 



(p-i) 



j=i 



'x(j) x'O)' 



n 



The pseudocode for calculating the dot product (^(x), $(x')) (where x' is known) is very similar, 
and the proof is essentially the same. 

Subroutine 4 Prod(*(x),x') 

Let n, x*^^', . . . , x^"' be the index and vectors comprising ^(x) 

Return /3„ 2;;^ n;Li(i'^'\x') 

We are now ready to prove Thm. [ll First, regarding the expected number of queries, notice 
that to run Algorithm [ll we invoke Map_Estimate and Grad_Length_Estimate once at round t. 
Map_Estimate uses a random number B of queries distributed as V(B = n) — (p — l)/p"^^, and 
Grad_Length_Estimate invokes Map_Estimate a random number C of times, distributed as P(C = 
71) = (p — l)/p"+^. The total number of queries is therefore X^i^i ^j^ where Bj for all j are i.i.d. 
copies of B. The expected value of this expression, using a standard result on the expected value 
of a sum of a random number of independent random variables, is equal to (1 + E[C])E[i3-,], or 

\^ ^ p-i/p-i ~ (p-i)2- 

In terms of running time, we note that the expected running time of Prod is 0(l -\ — ^y), 
this because it performs N multiplications of inner products, each one with running time 0{d), 
and E[A^] = -^. The expected running time of Map_Estimate is 0(l + —zj)- The expected 
running time of Grad_Length_Estimate is 0(l + -^{l + —zj) +2^(1 H — ~[)) , which can be written 
s-s 0( . f^^2 + T(^1 H — rj))' Since Algorithm 1 at each of T rounds calls Map_Estimate once, 
Grad_Length_Estimate once, Prod for 0{T^) times, and performs 0(1) other operations, we get 
that the overall runtime is 

Since ^'-j- < -p^W, we can upper bound this by 

o[t[i + ^-^ + t^(i + ^^]]]=o(t^(i ' ''P 



(p-i)2 V (p-i)v;; V V b-i) 

The regret bound in the theorem follows from Thm. HI with the expressions for constants following 
from Lemma [2] Lemma l3J and Lemma l4] 

5.3 Proof Sketch of Thm. [s] 

To prove the theorem, we use a more general result which leads to non-vanishing regret, and then 
show that under the assumptions of Thm. [31 the result holds. The proof of the result is given in 
Appendix [F] 

Theorem 5. Let W be a compact convex subset o/M'' and pick any learning algorithm which selects 
hypotheses from W and is allowed access to a single noisy copy of the instance at each round t. If 
there exists a distribution over a compact subset of W^ such that 

argminE[^((vir,x), 1)] and argmin£((w, E[x]), l) (3) 

are disjoint, then there exists a strategy for the adversary such that the sequence wi, W2, ■ ■ ■ G W of 
predictors output by the algorithm satisfies 



T 
sup max — 

T- '"' ^ ^_^ 

with probability 1 with respect to the randomness of the oracles. 



1 ^ 
limsupmax - V(^((wt,Xt),yi) - £((w,Xt), y^) ) > 



Another way to phrase this theorem is that the regret cannot vanish, if given examples sampled 
i.i.d. from a distribution, the learning problem is more complicated than just finding the mean of the 
data. Indeed, the adversary's strategy we choose later on is simply drawing and presenting examples 
from such a distribution. Below, we sketch how we use Thm. [5| in order to prove Thm. [3| A full 
proof is provided in Appendix |E[ 



We construct a very simple one-dimensional distribution, which satisfies the conditions of Thm.jS] 
it is simply the uniform distribution on {3x, — x}, where x is the vector (1,0,..., 0). Thus, it is 
enough to show that 

argmin £(3w, 1) + ^(— w, 1) and argmin £(w, 1) (4) 

w:\w\'^<B„ w:\w\'^<B„ 

are disjoint, for some appropriately chosen B^. Assuming the contrary, then under the assumptions 
on i, we show that the first set in Eq. Q is inside a bounded ball around the origin, in a way 
which is independent of i?w, no matter how large it is. Thus, if we pick B^ to be large enough, 
and assume that the two sets in Eq. Q are not disjoint, then there must be some w such that both 
£{311), 1) + £{~w, 1) and £{w, 1) have a subgradient of zero at w. However, this can be shown to 
contradict the assumptions on £, leading to the desired result. 

6 Future Work 

There are several interesting research directions worth pursuing in the noisy learning framework 
introduced here. For instance, doing away with unbiasedness, which could lead to the design of 
estimators that are applicable to more types of loss functions, for which unbiased estimators may 
not even exist. Also, it would be interesting to show how additional information one has about the 
noise distribution can be used to design improved estimates, possibly in association with specific 
losses or kernels. Another open question is whether our lower bound (Thm. ^ can be improved 
when nonlinear kernels are used. 
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A Alternative Notions of Regret 

In the online setting, one may consider notions of regret other than IT] One choice is 

T T 

^^((Wi,vl/(x0),j/t)- min^£((w,*(iO),yt) 
t=i ^ t=i 

but this is too easy, as it reduces to standard onhne learning with respect to examples which happen 
to be noisy. Another kind of regret we may want to minimize is 

T 

Y,ii{^u^{i^t)),yt)- min/((w„M'(xO),2/0 • (5) 

t=l 

This is the kind of regret which is relevant for actually predicting the values yt well based on the 
noisy instances. Unfortunately, in general this is too much to hope for. To see why, assume we deal 
with a linear kernel (so that ^(x) = x), and assume £{w,x,y) = ((w,x) — y)'^. Now, suppose that 
the adversary picks some w* 7^ in W, which might be even known to the learner, and at each 
round t provides the example (w*/|jvif*||, 1). It is easy to verify that Eq. ((sl) in this case equals 



^((w,,>c,)-l) 



t=i 



Recall that the learner chooses w^ before x^ is revealed. Therefore, if the noise which leads to Xj 
has positive variance, it will generally be impossible for the learner to choose wt such that (wt,xt) 
is arbitrarily close to 1. Therefore, the equation above cannot grow sub- linearly with T. 

B Proof of Thm. [2] 

The analysis in this subsection is similar to the one of Subsection |5.2[ focusing on Gaussian kernels. 
Namely, we assume here that the kernel fc(x, x') is equal to e^"''^'' H /'^ for some a^ > 0. 

We start by constructing an explicit feature mapping ^(•) corresponding to the RKHS induced 
by our kernel. For any x,x', we have that 

fc(x,x') = e-"---'!!'/'^' ^ p^-||x||V-%-||x'||V-%2(x,x')/.= 



CXD 



||x||V^%-ltx'f/-^ /^(2(x,x'))" 




„x„V.V„xW(^^...^(^.,....,X. 
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This suggests the following feature representation: for any x, ^(x) returns an infinite-dimensional 
vector, indexed by n and fci, . . . , A:„ G {1, . . . , d}, with the entry corresponding to n, /ci, . . . , fc„ being 
g-||x|| /<y Lll_j_rj^^^ _ ^ _ rj.^^_ rpj-^g ^Q^ product between ^(x) and \I'(x') is similar to a standard dot 
product between two vectors, and by the derivation above equals A:(x, x') as required. 

The idea of deriving an unbiased estimate of ^(x) is the following: first, we sample Ni,N2 
independently according to P(iVi = ni) = ¥{N2 = n2) — {p — l)/p"+^. Then, we query the oracle 
for X for 2Ni + N2 times to get x^, . . . , x(2^i+^'2)^ ^j^^j formally define ^(x) as 

^^ ' \3 = l J \/ci,...,fc„,=l 

.(6) 

where e^Va.fei,. ..,*:« represents the unit vector in the direction indexed by iV2, fci, . . . , /cat^ as explained 
above. Since the oracle calls are i.i.d., it is not hard to verify that the expectation of the expression 



above is 






(-||x||V(72)"i\ / ^ (2/ct2)«2 



d 



^"1=0 ^' / yn2=0 ^' fei,...,fe„2=l 

"2=0 fci,. ..,/£„,=! 

which is exactly ^(x) as defined above. 

To actually store ^(x) in memory, we simply keep and x^^\ . . . ^x^^^i+^2) ^ ^j^g representation 
above is used implicitly when we calculate dot products between \I'(x) and other elements in the 
RKHS, via the subroutine Prod. We formalize the needed properties of ^(x) in the following lemma. 

Lemma 5. Assuming the construction o/^(x) as in the discussion above, it holds that Et[^(x)] = 
^(x) for all X. Moreover, if the noisy sample X( returned by the oracle At satisfies E[||xt||^] < S^, 
then 
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l^(xt) 



Proof. The first part of the lemma follows from the discussion above. As to the second part, note 
that by (|6]), we have that 
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The expectation of this expression over Ni , N2 is equal to 
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After the discussion above, the pseudocode for Map_Estimate below should be self-explanatory. 

Subroutine 5 Map_Estimate(At,p) 

Sample A^i according to F{Ni = ni) = (p - l)/p"i+i 
Sample A^2 according to P{N2 = 712) = (p - 1)/?"^+^ 
Query At for 2iVi + N2 times to get x|^\ . . . , ji^^^^+^^^ 
Return xW,..., if ^^+^^) as ^(x,). 



We now turn to the subroutine Prod, which given two elements in the RKHS, returns their dot 
product. This subroutine comes in two flavors: either as a procedure defined over ^(x),^(x') and 
returning (^(x), ^(x')) (Subroutine[6J); or as a procedure defined over ^(x), x' (Subroutine[7| where 
the second element is an explicitly given vector) and returning (^(x), ^(x')). This second variant 
of Prod is needed when we wish to apply the hypothesis on a new (known) instance x'. 

Subroutine 6 Prod(>I'(x), v^(x')) 



Let x("\ . . . ^x(2ni+"2) ^g ^jjg vectors comprising ^(x) 

tors comprising ^(x' 

C_X)"i+"ir)"i+"i+2".2+422n2 



T + ~S^'> ~,(2"i+"2) , ,, , . . ,T,/ M 

Let x' , . . . , x' be the vectors comprismg W(x ) 



If 712 7^ '^2 return 0, else return 



ni!n;!(n2!)2cr2("i+"i+2"2)(p-l)4 

X (n;=i(x('^-^\x(2j'))) (n;ii(x'(^^'''\i'('^')>) (n;;i(x(2"^+i),x'(2"i+^)) 

The proof of the following lemma is a straightforward algebraic exercise, similar to the proof of 
Lemma ID 

Lemma 6. Prod(*(x), §(x')) returns (*(x), §(x')). 

The pseudocode for calculating the dot product (\['(x), ^(x')) (where x' is known) is very similar, 
and the proof is essentially the same. 

Subroutine 7 Prod(§(x),x') 

Let x^^\ . . . ^x'^2"i+"2) be the vectors comprising ^(x) 



Return 

rii!(n2!)2cr2("i+2"2)(p_i)2' 
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We are now ready to prove Thm. [21 First, regarding the expected number of queries, notice 
that to run Algorithm [Tl we invoke Map_Estimate and Grad_Length_Estimate once at round t. 
Map_Estimate uses a random number 2Bi + B2 of queries, where Bi,B2 are independent and dis- 
tributed as P(i?i — n) — P(i?2 — n) = {p ~ l)/p"+^. Grad_Length_Estimate invokes Map_Estimate 
a random number C of times, where P(C = n) = (p — l)/p"+^. The total number of queries is 
therefore 'J2i=i (2^j".i + Bj,2), where Bji, Bj2 are i.i.d. copies of i3i,i?2 respectively. The expected 
value of this expression, using a standard result on the expected value of a sum of a random number 
of random variables, is equal to (1 + E[C]){2E[Bj^i] + E[Bj^2]), or (l + ^) ^ = b^' 

In terms of running time, the analysis is completely identical to the one performed in the proof 
of Thm. [1] and the expected running time is the same up to constants. 

The regret bound in the theorem follows from Thm.|4J with the expressions for constants following 
from Lemma [2] Lemma [5l and Lemma |6] 

C Proof of Examples [3] and [3] 

Examples p^ and H use the error function Erf (a) in order to construct smooth approximations of the 
hinge loss and the absolute loss (see Fig. fll). The error function is useful for our purposes, since it 
is analytic in all of M, and smoothly interpolates between —1 for a ^ and 1 for a ^ 0. Thus, it 
can be used to approximate derivative of losses which are piecewise linear, such as the hinge loss 
£{a) = max{0, 1 — a} and the absolute loss £(a) = \a\. 

To approximate the absolute loss, we use the antiderivative of Erf (sa). This function represents 
a smooth upper bound on the absolute loss, which becomes tighter as s increases. It can be verified 
that the antiderivative (with the constant free parameter fixed so the function has the desired 

behavior) is 

_ 2 2 

i{a) = a Erf (sa) + ^—-^- . 



While this loss function may seem to have slightly complex form, we note that our algorithm only 
needs to calculate the derivative of this loss function at various points (namely Erf(sa) for various 
values of a), which can be easily done. 

By a Taylor expansion of the error function, we have that 



Therefore, £'_^{a) in this case is at most 
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We now turn to deal with Examplel4J This time, we use the antiderivative of (Erf(s(a — 1)) — l)/2. 
This function smoothly interpolates between —1 for a ^ — 1 and for a ;:^ 0. Therefore, its 
antiderivative with respect to x represents a smooth upper bound on the hinge loss, which becomes 
tighter as s increases. It can be verified that the antiderivative (with the constant free parameter 
fixed so the function has the desired behavior) is 

^^^^ ^ (a-l)(Erf(.(a-l))-l) ^ e-^(-i)^ 
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By a Taylor expansion of the error function, we have that 

(-l)"(s(a-l))2"+i 



1 1 



n=0 



n!(2n + l) 



Thus, i\{a) in this case can be upper bounded by 
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D Proof of Thm. H 

Our algorithm corresponds to Zinkevich's algorithm fl7J in a finite horizon setting, where we assume 
the sequence of examples is gi^(xi), . . . , gT^(xT), the cost function is linear, and the learning rate 
at round t is r]/\/T. By a straightforward adaptation of the standard regret bound for that algorithm 
(see [17]), we have that for any w such that |lw|p < B^, 



^(wt,5t^(xt)>-^(w,.gtl'(xt)) < 
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We now take expectation of both sides in the inequality above. The expectation of the right-hand 
side is simply 
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As to the left-hand side, note that 
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Also, 



^(wt,.9t^(xt)) 



= E 
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^Et[(wt,gt*(xO) 



= E 



^(wt,yt^'(yt(w,,vl.(xt)))*(x0) 
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^(w,.gt*(xi)) 



= ^(w,f(y,(w,,vl/(x,)))*(x,)) 



Plugging in these expectations and choosing rj 

Bw, 

T T 



we get that for any w such that ||w||2 < 



E 



5^(w,,y/(yt(wi,vl/(xO))*(xO)-^(w,/(y,(wt,vl/(xO))*(x,)) 



< \/B^BgBq,T. 



To get the theorem, we note that by convexity of £, the left-hand side above can be lower bounded 
by 

' T T 

.t=i t=i 

E Proof of Theorem |3] 

Fix a large enough B^ > 1 to be specified later. Let x = (f , 0, . . . , 0) and let T) to be the uniform 
distribution on {3x, — x}. To prove the result then we just need to show that 

argmin £(3w, 1) + £{—11], 1) and argmin £(w, 1) (7) 

w:\w\^<B„ w:\w\^<B„ 

are disjoint, for some appropriately chosen B-^. 

First, we show that the first set above is a subset of {w : |wp < R} for some fixed R which does 
not depend on B^. We do a case-by-case analysis, depending on how £{■, 1) looks like. 

1. £(•, 1) monotonically increases in JR. Impossible by assumption (2). 

2. £{■, I) monotonically decreases in JR. First, recall that since £{■, 1) is convex, it is differentiable 
almost anywhere, and its derivative is monotonically increasing. Now, since £{■, 1) is convex and 
bounded from below, £'{w, 1) must tend to as w — > oo (wherever £(■, 1) is differentiable, which 
is almost everywhere by convexity). Moreover, by assumption (2), £'{w,l) is upper bounded 
by a strictly negative constant for any w < 0. As a result, the gradient of £{3111, 1) -I- £{—w, 1), 
which equals 3^'(3w, 1) — £'{—w, 1), must be positive for large enough w > 0, and negative for 
large enough w < 0, so the minimizers of £{3w, 1) -I- £{—w, 1) are in some bounded subset of M. 

3. There is some s e M such that ^(-,1) monotonically decreases in {—oo,s) and monotonically 
increases in (s, oo). If the function is constant in (s, cx)) or in (— oo, s), we are back to one of the 
two previous cases. Otherwise, by convexity oi £{■), we must have some a,b, a < s < b, such that 
£{■, 1) is strictly decreasing at (—00,0), and strictly increasing at (6,00). In that case, it is not 
hard to see that £{3w, 1) + £{—11], 1) must be strictly increasing for any w > max{|a|, |6|}, and 
strictly decreasing for any w < — max{|a|, |6|}. So again, the minimizers of i{3w, 1) -I- £{—w, 1) 
are in some bounded subset of M. 

We are now ready to show that the two sets in ([7| must be disjoint. Suppose we pick 5^ large 
enough so that the first set in pj is strictly inside {w : \w\'^ < B^}. Assume on the contrary 
that there is some w, |wp < B^, which belongs to both sets in nj. By assumption (2) and the 
fact that w minimizes £{w,l), we may assume w > 0. Therefore, € d£{w,l) as well as G 
d{£{3w,l) + £{—w,l)), where df is the (closed and convex) subgradient set of a convex function 
/. By subgradient calculus, this means there is some a/3 € d£{3w, 1) and b £ d£{—w, 1) such that 
a/3-b = 0. This implies that d£{3w, 1) n d£{-w, 1) 7^ 0. Now, suppose that max9^(-u;, 1) < 0. 
This would mean that xmx\d£{3w,l) < 0. But then £{-,!) is strictly decreasing at {w,3w), and in 
particular (,{w,X) > £{3w,l), contradicting the assumption that w minimizes £{-,l). So we must 
have max9£(— w, I) > 0. Moreover, inhid£{—w, I) < (because w minimizes £{■, I) and —w < w). 
Since the subgradient set is closed and convex, it follows that £ d£{—w, 1). Therefore, both w and 
—w minimize £{■, 1). But this means that £'{0) — 0, in contradiction to assumption (2). 

F Proof of Thm. [5] 

Let P be a distribution which satisfies (pi) . The idea of the proof is that the learner cannot know 
if T) is the real distribution (on which regret is measured) or the distribution which includes noise. 
Specifically, consider the following two adversary strategies: 

1. At each round, draw an example from 2?, and present it to the learner (with the label 1) without 
adding noise. 

2. At each round, pick the example E-p[x], add to it zero-mean noise sampled from Z — E-p[x], 
where Z is distributed according to T), and present the noisy example (with the label 1) to the 
learner. 

In both cases the examples presented to a learner appear to come from the same distribution T). 
Hence, any learner observing one copy of each example cannot know which of the two strategies is 
played by the adversary. Since (Isl) implies that the set of optimal learner strategies for each of the 



two adversary strategies are disjoint, by picking an appropriate strategy the adversary can force a 
constant regret. 

To formalize this argument, fix any learning algorithm that observes one copy of each example 
and let Wi,W2, ... be the sequence of generated predictors. Then it is sufficient to show that at 
least one of the following two holds 



lim sup max E 



>0 



^^€((w,,x,),l)-£((w,x,),l) 
limsup- V^((v^ff,E[x]),l)- minf((w,Efx]),l) >0 w.p. 1 



t=\ 



wew 



(8) 



(9) 



where in both cases the expectation is with respect to T) and "w.p. 1" refers to the randomness of 
the noise. First note that ([8]) is implied by 



limsup — V'£((vift,xt),l)- min E £((w,x),l 



r-»- 
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wew 



> w.p. 1. 



(10) 



Since W is compact, T) is assumed to be supported on a compact subset, and (. is convex and hence 
continuous, then i'((w,x), 1) is almost surely bounded. So by Azuma's inequality 



^P -^(E,[£((w„x),l)]-f((w„x,),l) 
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> e < OO 
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where the expectation Et[-] is conditioned on the randomness in the previous rounds. Letting 
wt = I X]s=i ^s (which belongs to W for all t since it is a convex set), we have 

T T 

^^^((w„x,),1)>;^^E,[£((w„x),1)]>e[£((wt,x),1) 



T 



where the first inequality holds with probability 1 as T — >■ oo by the Borel-Cantelli lemma, and the 
second one holds for every T because i is convex. 
Similarly, 

1 ^ 

-5^£((w„E[x]),1)>£((wt,E[x]),1). 

t=\ 

Hence (|9])-( 10 ) are obtained if we show that no single sequence of predictors wi , W2, . . . simultane- 
ously satisfies 

limsupi^i(w7-) < and limsupi^2(wT) < (11) 



T->-oo 



T-j-o 



where 



Fi(wt)=E £((wT,x),l) -minEr£((w,x),l)] F2(wt) = ^((wT,Efx]), l)- min ^((w,Efx]), l) . 
L J wsw wew 

Suppose on the contrary that there was such a sequence. Since w^ G W for all T, and W is 
compact, the sequence Wi, W2, . . . has at least a cluster point w e W. Moreover, it is easy to verify 
that the functions F^ and F2 are continuous. Indeed, ^((•, E[x]), 1) is continuous by convexity of ^ and 
E[£((-,x), 1)] is continuous by the compactness assumptions. Hence, any cluster point of Wi,W2, . . . 
is also a cluster point of both F\ and F^. Since -^1,^2 > by construction, and we are assuming 
that neither Fi(w) > nor Fi(w) > for any cluster point w, we must have i^i(w) = F2(w) = 0. 
But this means that w belongs to both sets appearing in J3|, in contradiction to the assumption 
they are disjoint. Thus, no sequence of predictors satisfies (fill), ^^ desired. 



